Exploration of mRNAs and miRNA classifiers for various ATLL cancer subtypes using machine learning

Background Adult T-cell Leukemia/Lymphoma (ATLL) is a cancer disease that is developed due to the infection by human T-cell leukemia virus type 1. It can be classified into four main subtypes including, acute, chronic, smoldering, and lymphoma. Despite the clinical manifestations, there are no reliable diagnostic biomarkers for the classification of these subtypes. Methods Herein, we employed a machine learning approach, namely, Support Vector Machine-Recursive Feature Elimination with Cross-Validation (SVM-RFECV) to classify the different ATLL subtypes from Asymptomatic Carriers (ACs). The expression values of multiple mRNAs and miRNAs were used as the features. Afterward, the reliable miRNA-mRNA interactions for each subtype were identified through exploring the experimentally validated-target genes of miRNAs. Results The results revealed that miR-21 and its interactions with DAAM1 and E2F2 in acute, SMAD7 in chronic, MYEF2 and PARP1 in smoldering subtypes could significantly classify the diverse subtypes. Conclusions Considering the high accuracy of the constructed model, the identified mRNAs and miRNA are proposed as the potential therapeutic targets and the prognostic biomarkers for various ATLL subtypes. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-022-09540-1.


Background
Adult T-Cell Leukaemia/Lymphoma (ATLL) is a type of cancer disease which is developed due to the infection by Human T-Cell Leukemia Virus type 1 (HTLV-1). It provides the aggressive malignant of CD4+ T lymphocytes [1]. In fact, the infection by HTLV-1 can lead to the progression of two main diseases including ATLL and HTLV-1-Associated Myelopathy/Tropical Spastic Paraparesis (HAM/TSP).
HTLV-1 is an endemic virus with the prevalence of more than 20 million people worldwide in several regions, including, the East North of Iran, some parts of South America, the Caribbean, and Japan. ATLL develops in about 5% of the infected patients after a long dormancy period which are called Asymptomatic Carriers (ACs) [2].
Two main viral proteins are the viral transactivating protein Tax-1 and HTLV-1 bZIP factor / HTLV-1 basiczipper factor (HBZ) which have critical roles in the development of diseases. Tax-1 implicates the transformation and the proliferation of the infected T cells. However, ATLL cells often lose the Tax expression because of the epigenetic and genetic alterations in the proviral genome. Furthermore, HBZ protects the proliferation of ATLL cells [3,4].

Open Access
*Correspondence: mohadesehzaree@gmail.com; r.emamzadeh@sci.ui.ac.ir 1 Department of Cell and Molecular Biology and Microbiology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran Full list of author information is available at the end of the article ATLL is categorized into four main subtypes according to Shimoyama classification: acute, chronic, smoldering, and lymphoma [5,6]. The acute and lymphoma subtypes are characterized by aggressive behavior and poor prognosis. While the chronic and smoldering subtypes are specified by an indolent clinical course and different clinicopathologic features. The hepatosplenomegaly and elevated lactate dehydrogenase are observed in the acute type and also less frequently in the lymphoma type [7]. In addition, the acute type is identified by unusual lymphocytes in the peripheral blood and the blood circulating. The chronic subtype usually causes leukocytosis with absolute lymphocytosis, skin rash, hypercalcemia, and moderate lymphadenopathy [8,9]. The smoldering subtype is asymptomatic which is specified by less than 5% circulating irregular lymphoid cells without organomegaly or hypercalcemia [10].
Several studies explored the possible pathogenesis mechanisms of the HTLV-1 infection in ACs toward ATLL and/or HAM/TSP [2,[11][12][13][14][15]. However, some of them considered ATLL disregarding the subtypes. In addition, the subtypes of ATLL have poor prognosis due to the inherent chemoresistance and the intense immunosuppression. Moreover, the manifestations and cycles of the disease are heterogeneous [16]. Therefore, for identifying the subtypes of ATLL with the highest accuracy and also for selecting the conventional treatments, the computational classification methods could be beneficial.
In this investigation, we utilized a machine learning method for classifying three subtypes of ATLL. It led to finding the powerful mRNAs and miRNA classifiers between these subtypes and ACs. The identified classifiers could determine the pathogenesis routes from the infected HTLV-1 toward the development of each ATLL subtype.

Dataset collection and preprocessing
We downloaded four microarray datasets, from the Gene Expression Omnibus (GEO) repository website. The datasets including GSE55851 [17] and GSE33615 [18] contain the genes expression in the whole blood or the Peripheral Blood Mononuclear Cells (PBMCs) of three subtypes including acute, chronic, and smoldering.
The GSE29332 [19] and GSE29312 [19] include the gene expression in the PBMCs of AC carriers. A total of 29 acute, 23 chronic, and 10 smoldering ATLL subjects, as well as 37 ACs samples containing 15,565 common genes, were used for further analysis. Moreover, to find the miRNA classifiers, the datasets were employed with the accession numbers GSE46345 [20] and GSE31629 [18]. They contain the miRNA expressions of ACs and ATLL subjects. A total of 12 ACs and 40 ATLL samples including the expression of 549 miRNAs were involved in the analysis. The characteristics of the datasets are specified in Table 1. To remove the batch effect among the datasets, the function of removeBatchEffect in the Limma package was employed [21]. The data were randomly divided into the train and test sets in Python (65/35).

Support vector machine-recursive feature elimination with cross-validation (SVM-RFECV)
Here, to determine the specific features that can classify the various ATLL subtypes, SVM-RFECV based on the tenfold cross-validation was employed [22]. RFE is a wrapper variable selection approach that utilizes the interior filter-based variable selection. SVM-RFE is principally a backward elimination manner, in which the top-ranked features are the most relevant conditional variables on the special ranked subset in the model. The topranked features in the final iteration of SVM-RFE are the substantial informative variables and the bottom-ranked features are the insubstantial ones that can be removed  [23]. SVM-RFECV comprises five steps: 1) Training the train set by the tenfold cross-validation SVM; 2) Ordering the variables using the weights of the obtained classifier; 3) Eliminating the variables with the smallest weight; 4) Updating the training dataset according to the chosen variables; 5) Repeating the steps with the training set limited to the remaining variables [24]. We employed SVM-RFECV algorithm in Python 3.9.

Identification of differentially expressed genes (DEGs)
To determine differentially expressed genes between each ATLL subtype and the AC samples, the Limma package in R environment programming was employed [25].
Benjamini-Hochberg FDR adjusted p-values < 0.05 and logFC = |5| were chosen as the criteria for exploring the remarkable DEGs.

Determination of target genes of miRNAs
To find the experimentally validated target genes of miRNAs, miRTarBase database [15,26] was used. The network of miRNA-target genes was visualized by Cytoscape 3.6.1.

Pathway enrichment analysis
In order to pathway enrichment analysis of the identified classifier genes for each subtype, the ToppGene database was employed [27]. The terms with adj.P.value < 0.05 were determined as statistically remarkable.

Determination of DEGs
A total of 5327, 5525, and 5185 DEGs were found among ACs with ATLL_acute, ATLL_chronic, and ATLL_ smoldering, respectively (Supplementary data file 1). Afterward, the unique DEGs belonging to each subtype were explored. The Venn diagram shows 521, 594, and 187 unique DEGs for ATLL_chronic, ATLL_acute, and ATLL_smoldering, respectively (Fig. 1). These DEGs were considered the selected variables for each subtype (Supplementary data file 2). Therefore, the matrices containing the expression values of the selected features for each sample were constructed for machine learning.

Classification of ATLL subtypes using SVM-RFECV
The SVM-RFECV analysis was utilized to find the features that could classify the various ATLL subtypes from ACs. For this purpose, unique DEGs for each  Table 2. A total of 27, 9, and 32 genes were found as the best classifiers for ATLL_acute, ATLL_chronic, and ATLL_smoldering, respectively. Furthermore, the confusion matrix and the classification reports for the test sets are visualized in Fig. 2a-f. The results showed that the selected features could significantly classify the various subtypes of ACs. The accuracy for the test set was found as 1.00, 0.95, and 0.95 for the ATLL_acute, ATLL_ chronic, and ATLL_smoldering, respectively. In order to find the activated pathways by the genes classifiers for each subtype, the pathway enrichment analysis was performed. The involvement of each gene in each pathway and also the previously reported function of the genes in the ATLL progression were mentioned in Supplementary data file 3.
The genes classifiers for ATLL_acute were enriched in Glutathione metabolism, Urea cycle and the metabolism of amino groups, beta-Alanine metabolism, Cysteine and methionine metabolism, sulfate activation for sulfonation, CXCR4-mediated signaling events, Metabolism of polyamines, Amino Acid metabolism, Metabolic pathways, Pathways in cancer, Hypoxia and p53 in the Cardiovascular system, Interferon Signaling, the planar cell polarity Wnt signaling, Noncanonical Wnt signaling pathway, Expression of cyclins regulates progression through the cell cycle by activating cyclin-dependent kinases.
In addition, the genes classifiers for ATLL_chronic in tRNA modification in the nucleus and cytosol, TGFbeta Receptor Signalling in Skeletal Dysplasias, tRNA processing, altered transforming growth factor-beta Smad dependent signaling, Cell to Cell Adhesion Signaling, CD40L Signaling Pathway, Cytokine Signaling

Finding miRNA-gene classifier between ATLL subtypes and ACs
As there are no reliable datasets to investigate the miRNA expression through ATLL subtypes, we considered miRNA expression in ATLL, generally. The SVM_ RFECV analysis revealed the miR-21 as the best miRNA with an accuracy of 100% for classifying the ATLL from ACs. The confusion matrix and classification report are depicted in Fig. 3a, b. The target genes of this miR-21 were then found in the miRTarBase database (Supplementary data file 4). Next, the common genes were identified between the target genes and the classifier ones in each subtype. As a result, DAAM1 and E2F2 in acute, SMAD7 in chronic, MYEF2 and PARP1 in smoldering subtypes were specified (Fig. 4).

Discussion
ATLL cancer is considered one of the extremely aggressive T cell non-Hodgkin lymphoma variants. Four clinical variants of ATLL have been specified: acute, lymphomatype (lymphomatous), chronic, and smoldering. Shimoyama's criterion is limited for classifying some patients in the lack of a purposeful immunophenotypic precisely and clonal analysis of peripheral blood [28]. For example, HTLV-1 carriers without ATLL can contain up to 5% of blood-circulating atypical cells, which causes clinicians to classify the lymphomatous ATLL with circulating atypical cells as acute. Moreover, it has been reported that ATLL patients in different regions respond differently to accessible therapies. For instance, first-line zidovudine interferon-α (AZT-IFN) can be beneficial for the aggressive leukemic ATLL patients in the United States [28]. Moreover, AZT-IFN is a first-line choice for patients with non-bulky aggressive ATLL and non-lymphomatous. It can also be the best election for the patients with chronic-type ATLL. On the other hand, chemotherapy is a preferred option for the lymphomatous. It is the favored etoposide-based regimen for patients with aggressive ATLL in Latin America. While AZT-IFN is a well firstline choice for the acute subtype [29].
A recent study on Japanese patients disclosed the unsatisfactory prognosis of the acute ATLL type and the worse prognosis of the smoldering type [30]. As a result, the accurate classification of ATLL subtypes could be applied for the proper treatments. ATLL subtypes could be categorized into molecularly distinguished subsets with various prognoses. Moreover, genetic profiling could contribute to obtain the better management and prognostication of ATLL patients [31]. Each ATLL subtype can carry diverse genomic alterations and different clinical courses. In a recent study, the total structural variations, mutations, driver alterations, and abnormal CN segments were explored in the aggressive (acute) and the indolent (chronic and smoldering) subtypes [32]. In this study, we concentrate on the expression values of coding and non-coding RNAs. We applied the support vector machine-recursive feature elimination as a machine learning approach to classify the ATLL subtypes from ACs samples. Then, we identified the potential prognostic targets.
Acute ATLL includes the lymphoma cells that persist in the blood. The main characteristic of this subtype is its aggressive biology, with a median survival of only 4-6 months. The disease progresses rapidly in the bones, skin, lymph nodes, spleen, and liver. DAAM1 and E2F2 are two specific classifier genes for the acute ATLL. DAAM1 encodes a protein that contains two FH domains pertaining to the FH protein subfamily with a role in the cell polarity. It is likely acts as a scaffolding protein for the Wnt-induced assembly of a disheveled (Dvl)-Rho complex. It also boosts the nucleation and elongation of the new actin filaments and regulates the cell growth by the microtubules' stabilization. Moreover, it has been shown that DAAM1 can help the migration and the invasion of cancerous cells. Also, it can promote tumor advancement in Hepatocellular Carcinoma as well as breast and ovarian cancers [33][34][35].
The E2F2 protein is a transcription factor that has a substantial function in controlling the action of the tumor suppressor proteins and the cell cycle. Also, it is considered a target for the transforming proteins of the small DNA tumor viruses [36]. Particularly, E2F2 binds to the RB1 in a cell-cycle-dependent manner. RB1 mediates the control of the cell cycle through binding the E2F2 and also suppressing the expression from the E2F2-dependent promoters. It is concluded that E2F2 and DAAM1 could be considered for the prognosis of the acute ATLL subtype.
Another subtype of ATLL is chronic which is characterized by slow growth with an effect on the lungs, skin, lymph nodes, spleen, and liver. A higher number of T cells and lymphocytes in the blood are the signs of this subtype. SMAD7 encodes a nuclear protein that binds the E3 ubiquitin ligase SMURF2. After binding, this complex translocates to the cytoplasm and it can interact with TGFBR1 which results in the degradation of both the encoded protein and TGFBR1. The relationship between the expression of SMAD7 and lymphatic metastasis in gastric cancer has been reported [37]. Moreover, the survival of cancer cells and apoptosis were induced after SMAD7 transduction. The upregulation of SMAD7 interdicts the proliferation, boosts apoptosis, and inactivates the Smad signaling [38].
Smoldering ATLL similar to the chronic subtype grows slowly and affects the lungs or skin. It causes unusual T cell counts in the blood. MYEF2 and PARP1 are two classifier genes that we identified for the smoldering subtype. MYEF2 is the myelin expression factor 2, which acts as Fig. 4 The miR-21-gene target interaction for various ATLL subtypes a transcription suppressor of the myelin basic protein (MBP). MYEF2 is a downstream target that is modulated by the Wnt/β-catenin pathway. The genes regulated by Wnt/β-catenin can help for identifying the pathogenesis mechanisms of cancer and therapies [39]. Furthermore, the possible carcinogenesis role of MYEF2 has been proposed; however, its performance in cancer is still unknown and it should be evaluated in further studies.
PARP1 encodes a chromatin-associated enzyme, namely, poly (ADP-ribosyl) transferase, which rectifies several nuclear proteins by poly (ADP-ribosyl)ation. The modification relies on DNA and is implicated in the regulation of different significant cellular processes like the proliferation and the transformation of the tumor. Also, the regulation of the molecular events is involved in the cell recovery from DNA damage [40].
PARP1 is a coactivator for the HTLV-1 transcription activator Tax. It constitutes the active complexes on the promoter [41]. Furthermore, the expression of PARP1 is related to a progressive course of indolent mantle cell lymphoma. Therefore, it was proposed that PARP1 could be used for the initial diagnostic studies as a negative predictor [42].
Moreover, SVM-RFECV was employed for finding a promising classifier of miRNA. MiR-21 was identified as the best classifier between ATLL and ACs. It involves the acceleration of tumorigenesis and the onset of some tumor types [43]. It can target many genes as well as the above-mentioned genes which are involved in the progression of cancer and tumor. Therefore, its function should be surveyed in a complicated network of genes and the effect of other miRNAs.
Our study has some limitations. It is known that the chronic type is divided into favorable and unfavorable types based on some laboratory findings. The unfavorable chronic type is regarded as aggressive ATLL as well as the acute type. There are no expression data regarding these two groups, so we had to consider chronic ATLL generally regardless of subgrouping. Moreover, the identified classifiers should be experimentally validated in a large cohort containing the samples from various ATLL subtypes.

Conclusion
In summary, we identified the mRNAs and miRNA classifiers which could accurately classify the various ATLL subtypes vs. ACs. The outcomes disclosed the promising classifiers: SMAD7 in chronic, both MYEF2 and PARP1 in smoldering, and also both DAAM1 and E2F2 in acute subtypes. Moreover, miR-21 classified ATLL from ACs. However, further studies should be carried out to assess these classifiers, experimentally.